Effects of OCR errors on ranking and feedback using the vector space model
Identifieur interne : 002682 ( Main/Exploration ); précédent : 002681; suivant : 002683Effects of OCR errors on ranking and feedback using the vector space model
Auteurs : Kazem Taghva [États-Unis] ; Julie Borsack [États-Unis] ; Allen Condit [États-Unis]Source :
- Information Processing and Management [ 0306-4573 ] ; 1996.
Descripteurs français
- Pascal (Inist)
- Wicri :
- topic : Essai, Recherche documentaire, Système documentaire.
English descriptors
- KwdEn :
Abstract
We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.
Url:
DOI: 10.1016/0306-4573(95)00058-5
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 000017
- to stream Istex, to step Curation: 000017
- to stream Istex, to step Checkpoint: 001A93
- to stream Main, to step Merge: 002826
- to stream PascalFrancis, to step Corpus: 000A07
- to stream PascalFrancis, to step Curation: 000991
- to stream PascalFrancis, to step Checkpoint: 000955
- to stream Main, to step Merge: 002A42
- to stream Main, to step Curation: 002682
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title>Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
</author>
<author><name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
</author>
<author><name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<date when="1996" year="1996">1996</date>
<idno type="doi">10.1016/0306-4573(95)00058-5</idno>
<idno type="url">https://api.istex.fr/document/2022C26E3682F8C2CDD3580811393DEEE55E8CA8/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000017</idno>
<idno type="wicri:Area/Istex/Curation">000017</idno>
<idno type="wicri:Area/Istex/Checkpoint">001A93</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002826</idno>
<idno type="wicri:source">INIST</idno>
<idno type="RBID">Pascal:96-0295002</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000A07</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000991</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000955</idno>
<idno type="wicri:doubleKey">0306-4573:1996:Taghva K:effects:of:ocr</idno>
<idno type="wicri:Area/Main/Merge">002A42</idno>
<idno type="wicri:Area/Main/Curation">002682</idno>
<idno type="wicri:Area/Main/Exploration">002682</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a">Effects of OCR errors on ranking and feedback using the vector space model</title>
<author><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
<affiliation wicri:level="1"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Information Science Research Institute, University of Nevada, Las Vegas, Nev.</wicri:regionArea>
<wicri:noRegion>Nev.</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Information Processing and Management</title>
<title level="j" type="abbrev">IPM</title>
<idno type="ISSN">0306-4573</idno>
<imprint><publisher>ELSEVIER</publisher>
<date type="published" when="1996">1996</date>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="317">317</biblScope>
<biblScope unit="page" to="327">327</biblScope>
</imprint>
<idno type="ISSN">0306-4573</idno>
</series>
<idno type="istex">2022C26E3682F8C2CDD3580811393DEEE55E8CA8</idno>
<idno type="DOI">10.1016/0306-4573(95)00058-5</idno>
<idno type="PII">0306-4573(95)00058-5</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0306-4573</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="KwdEn" xml:lang="en"><term>Character recognition</term>
<term>Document retrieval</term>
<term>Document retrieval system</term>
<term>Error</term>
<term>Full text</term>
<term>Influence</term>
<term>Optical reading</term>
<term>Test</term>
<term>Vector space model</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr"><term>Erreur</term>
<term>Essai</term>
<term>Influence</term>
<term>Lecture optique</term>
<term>Modèle espace vectoriel</term>
<term>Recherche documentaire</term>
<term>Reconnaissance caractère</term>
<term>Système documentaire</term>
<term>Texte intégral</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Essai</term>
<term>Recherche documentaire</term>
<term>Système documentaire</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">We report on the performance of the vector space model in the presence of OCR errors. We show that average precision and recall is not affected for our full text document collection when the OCR version is compared to its corresponding corrected set. We do see divergence though between the relevant document rankings of the OCR and corrected collections with different weighting combinations. In particular, we observed that cosine normalization plays a considerable role in the disparity seen between the collections. Furthermore, we show that even though feedback improves retrieval for both collections, it can not be used to compensate for OCR errors caused by badly degraded documents.</div>
</front>
</TEI>
<affiliations><list><country><li>États-Unis</li>
</country>
</list>
<tree><country name="États-Unis"><noRegion><name sortKey="Taghva, Kazem" sort="Taghva, Kazem" uniqKey="Taghva K" first="Kazem" last="Taghva">Kazem Taghva</name>
</noRegion>
<name sortKey="Borsack, Julie" sort="Borsack, Julie" uniqKey="Borsack J" first="Julie" last="Borsack">Julie Borsack</name>
<name sortKey="Condit, Allen" sort="Condit, Allen" uniqKey="Condit A" first="Allen" last="Condit">Allen Condit</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 002682 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 002682 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:2022C26E3682F8C2CDD3580811393DEEE55E8CA8 |texte= Effects of OCR errors on ranking and feedback using the vector space model }}
This area was generated with Dilib version V0.6.32. |